Statistical Learning (Machine Learning),
Classification

Module 07

Ray J. Hoobler

Libraries

Code
library(tidyverse)
library(pROC)
library(class)
library(broom)

An Overview of Classification

Widely-used Classifiers

  • Logistic Regression
  • Linear Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Naive Bayes
  • K-Nearest Neighbors

Widely-used Classifiers

  • Logistic Regression
  • Linear Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Naive Bayes
  • K-Nearest Neighbors

Why Not Linear Regression?

Example: Challenger O-Ring Data

UCI Machine Learning Repository

(Draper 1993)

Number of Attributes: 5

 1. Number of O-rings at risk on a given flight  
 2. Number experiencing thermal distress  
 3. Launch temperature (degrees F)  
 4. Leak-check pressure (psi)  
 5. Temporal order of flight  
Code
variables <- c("total_num", "oring_distress", "launch_temp", "leak_pressure", "flight_order")

oring_data <- "datasets/challenger+usa+space+shuttle+o+ring/o-ring-erosion-or-blowby.data"

shuttle <- read_table(oring_data, col_names = variables) |> 
  mutate(distress_binary = if_else(oring_distress > 0, 1, 0))

Example: Challenger O-Ring Data

Code
shuttle

Example: Challenger O-Ring Data Plots

Linear Regression

Code
shuttle |> 
  ggplot(aes(x = launch_temp, y = distress_binary)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Challenger O-Ring Data",
       subtitle = "Predicted probabilities lie outside [0, 1]",
       x = "Launch Temperature (°F)",
      y = "Probability of Distress") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(50, 90, by = 5))

Logistic Regression

Code
shuttle |> 
  ggplot(aes(x = launch_temp, y = distress_binary)) +
  geom_point() +
  stat_smooth(method="glm", method.args = list(family="binomial"), se = FALSE) +
  labs(title = "Challenger O-Ring Data",
       subtitle = "Predicted probabilities lie between 0 and 1",
       x = "Launch Temperature (°F)",
       y = "Probability of Distress") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(50, 90, by = 5))

Logistic Regression

Logistic Regression (1/7)

The logistic regression models the probability of O-ring distress given the Launch Temperature.


\[ \text{Pr}(\text{Distress} = \text{Yes} | \text{Temperature}) \]

Logistic Regression (2/7)

We initially tried a linear regression model to represent the probability of distress.

\[ p(X) = \beta_0 + \beta_1 X \]

The linear model produced values outside the [0, 1] range.

Logistic Regression (3/7)

The logistic regression model uses the logistic function to model the probability of distress.


\[ p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]


This function produces values between 0 and 1 and we can easily plot the function in ggplot.

Code
# Create a data frame with x values
df <- tibble(x = c(-5, 5))

# Create the plot
ggplot(data = df, aes(x = x)) +
  geom_function(
    fun = function(x) 1 / (1 + exp(-x)), linewidth = 1
    ) +
  labs(title = "Logistic Function",
       x = "X",
       y = "Probability") +
  theme_minimal()

Logistic Regression (4/7)

Starting with the logistic function:

\[ p(X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

After some algebra:

\[ \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X} \]

The quantity on the LHS is called the odds.

Taking the natural log of both sides:

\[ \log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X \]

The quantity on the LHS is called the log-odds or logit. The logistic regression model above has a logit that is linear in \(X\).

Logistic Regression (5/7)

Estimating the Regression Coefficients

The logistic regression model is estimated using the maximum likelihood method.

The likelihood function is:

\[ \mathcal{L}(\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i) \prod_{i':y_{i'}=0} (1 - p(x_{i'})) \]

Taking the natural logarithm, we get the log-likelihood function:

\[ \ell(\beta_0, \beta_1) = \sum_{i:y_i=1} \log(p(x_i)) + \sum_{i':y_{i'}=0} \log(1 - p(x_{i'})) \]

Maximum Likelihood Estimation

The goal is to find \(\beta_0\) and \(\beta_1\) that maximize \(\ell(\beta_0, \beta_1)\). This is typically done by:

  1. Taking the partial derivatives of \(\ell(\beta_0, \beta_1)\) with respect to \(\beta_0\) and \(\beta_1\).

  2. Setting these derivatives to zero and solving the resulting equations.

  3. Since these equations are nonlinear, iterative methods or gradient descent are used to find the solution.

The solution \((\hat{\beta_0}, \hat{\beta_1})\) that maximizes \(\ell(\beta_0, \beta_1)\) is the maximum likelihood estimate of the parameters.

Logistic Regression (6/7)

Fit the model

Code
logit_model <- glm(distress_binary ~ launch_temp, data = shuttle, family = "binomial")

summary(logit_model)

Call:
glm(formula = distress_binary ~ launch_temp, family = "binomial", 
    data = shuttle)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  15.0429     7.3786   2.039   0.0415 *
launch_temp  -0.2322     0.1082  -2.145   0.0320 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28.267  on 22  degrees of freedom
Residual deviance: 20.315  on 21  degrees of freedom
AIC: 24.315

Number of Fisher Scoring iterations: 5

Calculate the predicted probablity of distres at 31°F

Code
predict(logit_model, 
        newdata = tibble(launch_temp = 31), 
        type = "response")
        1 
0.9996088 

Logistic Regression (7/7)

Shuttle O-Ring Data

Code
shuttle |> 
  ggplot(aes(x = launch_temp, y = distress_binary)) +
  geom_point() +
  stat_smooth(method="glm", method.args = list(family="binomial"), se = FALSE) +
  labs(title = "Challenger O-Ring Data",
       subtitle = "Predicted probabilities lie between 0 and 1",
       x = "Launch Temperature (°F)",
       y = "Probability of Distress") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(50, 90, by = 5))

Multiple Logisitc Regression

Similar to the linear regression model, we can include multiple predictors in the logistic regression model.

\[ log(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p \]

Expressed in terms of \(p(X)\):

\[ p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p}} \]

Example: Wine Quality Data (1/7)

UCI Machine Learning Repository

(Cortez and Reis 2009)

Code
wine_file <- "datasets/wine+quality/winequality-red.csv"

wine_df <- read_delim(wine_file, delim = ";", col_names = TRUE, show_col_types = FALSE) |> 
  mutate(quality = as_factor(quality))

# wine_df

wine_df |> 
  ggplot() +
  geom_bar(aes(x = quality), fill = "skyblue", color = "black") +
  theme_light()

Example: Wine Quality Data (2/7)

Recode the quality variable

Code
wine_df2 <- wine_df |> 
  mutate(quality_binary = if_else(as.numeric(quality) + 2 >= 7, 1, 0)) # Needing  "+2" was unexpected

wine_df2

Example: Wine Quality Data (3/7)

Create test train splits

Code
set.seed(42)
train_wine_index <- sample(1:nrow(wine_df2), 0.7 * nrow(wine_df2))

train_wine_data <- wine_df2[train_wine_index, ] |> 
  select(-quality)

test_wine_data <- wine_df2[-train_wine_index, ] |> 
  select(-quality)

Example: Wine Quality Data (4/7)

Fit the logistic regression model to the training data

Code
logit_model_wine <- glm(quality_binary ~ ., data = train_wine_data, family = "binomial")

summary(logit_model_wine)

Call:
glm(formula = quality_binary ~ ., family = "binomial", data = train_wine_data)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.383e+02  1.324e+02   1.044 0.296414    
`fixed acidity`         1.992e-01  1.516e-01   1.314 0.188808    
`volatile acidity`     -3.099e+00  9.724e-01  -3.187 0.001437 ** 
`citric acid`           2.592e-01  1.030e+00   0.252 0.801333    
`residual sugar`        1.970e-01  9.038e-02   2.179 0.029300 *  
chlorides              -1.134e+01  5.162e+00  -2.198 0.027980 *  
`free sulfur dioxide`   2.786e-02  1.613e-02   1.727 0.084093 .  
`total sulfur dioxide` -2.754e-02  7.131e-03  -3.862 0.000112 ***
density                -1.502e+02  1.353e+02  -1.110 0.266958    
pH                     -1.397e-01  1.222e+00  -0.114 0.908994    
sulphates               3.370e+00  6.972e-01   4.833 1.34e-06 ***
alcohol                 7.706e-01  1.581e-01   4.874 1.09e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 874.3  on 1118  degrees of freedom
Residual deviance: 603.1  on 1107  degrees of freedom
AIC: 627.1

Number of Fisher Scoring iterations: 6

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.


True: 0 True: 1
Predicted: 0 TN FP
Predicted: 1 FN TP

Accuracy = \(\frac{TP + TN}{TP + TN + FP + FN}\)

Precision = \(\frac{TP}{TP + FP}\)

Recall = \(\frac{TP}{TP + FN}\)

F1 Score = \(2 \times \frac{Precision \times Recall}{Precision + Recall}\)


Measures for Classification and Diagnostic Testing

Name Definition Synonyms
False Positive Rate (FPR) \(\frac{FP}{FP + TN}\) Type I Error Rate, 1 - Specificity
True Positive Rate (TPR) \(\frac{TP}{TP + FN}\) 1 - Type II error, Power, Sensitivity, Recall
Positive Predictive Value (PPV) \(\frac{TP}{TP + FP}\) Precision, 1 - false discovery proportion
Negative Predictive Value (NPV) \(\frac{TN}{TN + FN}\) 1 - false omission proportion

Example: Wine Quality Data (5/7)

Evaluate the model on the test data

Code
predictions_wine <- predict(logit_model_wine, newdata = test_wine_data, type = "response")
predicted_classes_wine <- ifelse(predictions_wine > 0.5, 1, 0)

conf_matrix_wine <- table(Predicted = predicted_classes_wine, Actual = test_wine_data$quality_binary)
accuracy_wine <- sum(diag(conf_matrix_wine)) / sum(conf_matrix_wine)
precision_wine <- conf_matrix_wine[2,2] / sum(conf_matrix_wine[2,])
recall_wine <- conf_matrix_wine[2,2] / sum(conf_matrix_wine[,2])
f1_score_wine <- 2 * (precision_wine * recall_wine) / (precision_wine + recall_wine)

print(paste("Accuracy:", round(accuracy_wine, 4)))
[1] "Accuracy: 0.8833"
Code
print(paste("Precision:", round(precision_wine, 4)))
[1] "Precision: 0.697"
Code
print(paste("Recall:", round(recall_wine, 4)))
[1] "Recall: 0.3333"
Code
print(paste("F1 Score:", round(f1_score_wine)))
[1] "F1 Score: 0"

Placing Model Metrics in a Data Frame

Code
wine_metrics_df <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1 Score"),
  Value = c(accuracy_wine, precision_wine, recall_wine, f1_score_wine)
) |> 
  mutate(Value = round(Value, 4))

wine_metrics_df

Example: Wine Quality Data (6/7)

Generate a ROC curve

Code
predicted_probs <- predict(logit_model_wine, newdata = test_wine_data, type = "response")

roc_obj <- roc(test_wine_data$quality_binary, predicted_probs, quiet = TRUE)

optimal_cutoff <- coords(roc_obj, "best", best.method = "closest.topleft")

roc_df <- data.frame(
  FPR = 1 - roc_obj$specificities,
  TPR = roc_obj$sensitivities
)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  labs(
    title = "ROC Curve for Wine Quality Data",
    subtitle = "Multiple Logistic Regression Model",
    x = "False Positive Rate",
    y = "True Positive Rate"
  ) +
  theme_light() +
  annotate("text", x = 0.8, y = 0.2, 
           label = paste("AUC =", round(auc(roc_obj), 3))) +
 annotate("text", x = 0.8, y = 0.1,
           label = paste("Optimal cutoff =", round(optimal_cutoff$threshold, 3)))

Metrics for the ROC curve

The receiver operating characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different cutoff (threshold) values. True positive rate is also known as the sensitivity; the false positive rate is also known as the 1 - specificity.

\[ \text{True Positive Rate} = \frac{TP}{TP + FN} \]

\[ \text{False Positive Rate} = \frac{FP}{FP + TN} \]

Example: Wine Quality Data (7/7)

Use the optimal cutoff to calculate a new confusion matrix and evaluation metrics

Code
predicted_classes_wine_optimal <- ifelse(predictions_wine > optimal_cutoff$threshold, 1, 0)

conf_matrix_wine_optimal <- table(Predicted = predicted_classes_wine_optimal, Actual = test_wine_data$quality_binary)

accuracy_wine_optimal <- sum(diag(conf_matrix_wine_optimal)) / sum(conf_matrix_wine_optimal)
precision_wine_optimal <- conf_matrix_wine_optimal[2,2] / sum(conf_matrix_wine_optimal[2,])
recall_wine_optimal <- conf_matrix_wine_optimal[2,2] / sum(conf_matrix_wine_optimal[,2])
f1_score_wine_optimal <- 2 * (precision_wine_optimal * recall_wine_optimal) / (precision_wine_optimal + recall_wine_optimal)

print(paste("Accuracy (Optimal Cutoff):", round(accuracy_wine_optimal, 4)))
[1] "Accuracy (Optimal Cutoff): 0.8"
Code
print(paste("Precision (Optimal Cutoff):", round(precision_wine_optimal, 4)))
[1] "Precision (Optimal Cutoff): 0.4029"
Code
print(paste("Recall (Optimal Cutoff):", round(recall_wine_optimal, 4)))
[1] "Recall (Optimal Cutoff): 0.8116"
Code
print(paste("F1 Score (Optimal Cutoff):", round(f1_score_wine_optimal)))
[1] "F1 Score (Optimal Cutoff): 1"

Confusion Matrix for Wine Quality Data

Default Threshold

Code
conf_matrix_wine
         Actual
Predicted   0   1
        0 401  46
        1  10  23

Optimal ROC Threshold

Code
conf_matrix_wine_optimal
         Actual
Predicted   0   1
        0 328  13
        1  83  56

K-Nearest Neighbors

K-Nearest Neighbors Analysis of Wine Data

K-nearst neighbors is a non-parametric method that classifies an observation based on the majority class of its \(k\) nearest neighbors.

We can apply the KNN algorithm to the wine quality data.

K-Nearest Neighbors Analysis of Wine Data

Code
train_wine_data_knn <- train_wine_data |>  select(-quality_binary)
test_wine_data_knn <- test_wine_data |>  select(-quality_binary)

knn_pred <- knn(train = train_wine_data_knn, 
                 test = test_wine_data_knn, 
                 cl = train_wine_data$quality_binary, 
                 k = 5)

conf_matrix_knn <- table(Predicted = knn_pred, Actual = test_wine_data$quality_binary)

accuracy_knn <- sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
precision_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[2,])
recall_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[,2])
f1_score_knn <- 2 * (precision_knn * recall_knn) / (precision_knn + recall_knn)

print(paste("Accuracy (KNN):", round(accuracy_knn, 4)))
[1] "Accuracy (KNN): 0.8521"
Code
print(paste("Precision (KNN):", round(precision_knn, 4)))
[1] "Precision (KNN): 0.475"
Code
print(paste("Recall (KNN):", round(recall_knn, 4)))
[1] "Recall (KNN): 0.2754"
Code
print(paste("F1 Score (KNN):", round(f1_score_knn)))
[1] "F1 Score (KNN): 0"

Confusion Matrix for K-Nearest Neighbors

Code
conf_matrix_knn
         Actual
Predicted   0   1
        0 390  50
        1  21  19

Tuning Hyperparameters

How dow we choose the best value for \(k\)?

“Loop” over different values of \(k\) and evaluate the model accuracy and precision.

Code
train_wine_data_knn <- train_wine_data |> select(-quality_binary)
test_wine_data_knn <- test_wine_data |> select(-quality_binary)

k_values <- seq(1, 19, 2)

knn_metrics <- map(k_values, function(k) {
                     knn_pred <- knn(train = train_wine_data_knn, 
                                     test = test_wine_data_knn, 
                                     cl = train_wine_data$quality_binary, 
                                     k = k)
    
  conf_matrix_knn <- table(Predicted = knn_pred, Actual = test_wine_data$quality_binary)
    
  accuracy_knn <- sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
  precision_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[2,])
  recall_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[,2])
  f1_score_knn <- 2 * (precision_knn * recall_knn) / (precision_knn + recall_knn)
    
  tibble(
      k = k,
      Accuracy = accuracy_knn,
      Precision = precision_knn,
      Recall = recall_knn,
      F1_Score = f1_score_knn)
  }
  ) |> list_rbind()

# print results
knn_metrics

Precision, Recall, and F1 Score

\[ \text{Precision} = \frac{TP}{TP + FP} \quad \quad \quad \text{Recall} = \frac{TP}{TP + FN} \quad \quad \quad \quad \text{F1 Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]

  • Precision is the proportion of positive cases that were correctly identified by the model.
  • Recall is the proportion of actual positive cases that were correctly identified by the model.
  • F1 Score is the harmonic mean of precision and recall.

k=1 has the highest recall and F1, which is probably most important in this context. (We should always be aware of the potential for over fitting the data.)

Generative Models for Classification

Generative Models

  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • Naive Bayes

A Comparison of Classification Methods

A Comparison of Classification Methods (ISLR2, p. 161)

  • Because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that \(n\) is very large and \(p\) is small.
  • In order to provide accurate classification, KNN requires a lot of observations relative to the number of predictors—that is, \(n\) much larger than \(p\). This has to do with the fact that KNN is non-parametric, and thus tends to reduce the bias while incurring a lot of variance.
  • In settings where the decision boundary is non-linear but \(n\) is only modest, or \(p\) is not very small, then QDA may be preferred to KNN. This is because QDA can provide a non-linear decision boundary while taking advantage of a parametric form, which means that it requires a smaller sample size for accurate classification, relative to KNN.
  • Unlike logistic regression, KNN does not tell us which predictors are important: we don’t get a table of coefficients.

Generalized Linear Models (Poisson Regression)

Poisson Regression of “Bike Sharing” Data

(Fanaee-T 2013)

(Included here as it is another example where linear regression is not appropriate.)

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Read and Review the Data

Code
bike_day_file <- "datasets/bike+sharing+dataset/hour.csv"
bike_share_hour <- read_csv(bike_day_file, col_names = TRUE)

bike_share_hour

Select 2011 Data and Plot

Code
bike_share_hour |> 
  filter(yr == 0) |>
  ggplot(aes(x = hr, y = cnt)) +
  geom_point(color = "skyblue", position = position_jitter(width = 0.1), alpha = 1/3) +
  geom_smooth(method = "lm",formula = y ~ x ,se = FALSE) +
  # geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) +
  # geom_smooth(method = "loess", formula = y ~ x, se = FALSE, span = 0.7) +
  labs(
    title = "Bike Sharing Data (2011)",
    x = "Hour of the Day",
    y = "Count of Bikes Rented"
  ) +
  theme_light()

Poisson Regression

Code
bike_share_hour_poisson <- bike_share_hour |> 
  filter(yr == 0) |> 
  mutate(hr = as.factor(hr))

poisson_model <- glm(cnt ~ hr, data = bike_share_hour_poisson, family = "poisson")

summary(poisson_model)

Call:
glm(formula = cnt ~ hr, family = "poisson", data = bike_share_hour_poisson)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.762295   0.008022  469.01   <2e-16 ***
hr1         -0.483265   0.012999  -37.18   <2e-16 ***
hr2         -0.821901   0.014645  -56.12   <2e-16 ***
hr3         -1.453588   0.018840  -77.15   <2e-16 ***
hr4         -2.077436   0.024793  -83.79   <2e-16 ***
hr5         -1.080652   0.016071  -67.24   <2e-16 ***
hr6          0.291584   0.010603   27.50   <2e-16 ***
hr7          1.292880   0.009051  142.85   <2e-16 ***
hr8          1.809838   0.008650  209.23   <2e-16 ***
hr9          1.336799   0.009009  148.39   <2e-16 ***
hr10         1.112019   0.009241  120.33   <2e-16 ***
hr11         1.287031   0.009056  142.11   <2e-16 ***
hr12         1.485279   0.008877  167.32   <2e-16 ***
hr13         1.487314   0.008875  167.58   <2e-16 ***
hr14         1.445238   0.008910  162.20   <2e-16 ***
hr15         1.476453   0.008884  166.19   <2e-16 ***
hr16         1.695506   0.008719  194.45   <2e-16 ***
hr17         2.094714   0.008496  246.55   <2e-16 ***
hr18         2.013104   0.008538  235.78   <2e-16 ***
hr19         1.703100   0.008718  195.35   <2e-16 ***
hr20         1.391299   0.008959  155.29   <2e-16 ***
hr21         1.140324   0.009209  123.82   <2e-16 ***
hr22         0.880108   0.009534   92.31   <2e-16 ***
hr23         0.474563   0.010206   46.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  413848  on 8621  degrees of freedom
AIC: 466934

Number of Fisher Scoring iterations: 5

Poisson Regression

Plot of coefficients

Code
poisson_coefs <- tidy(poisson_model) |> 
  filter(term != "(Intercept)")

ggplot(poisson_coefs, aes(x = seq_along(estimate), y = estimate)) +
  geom_point(color = "blue") +
  geom_line(color = "blue") +
  labs(
    title = "Poisson Regression Coefficients",
    x = "Hour of the Day",
    y = "Coefficient"
  ) +
  theme_light()

Poisson Regression

Plot mean of predicted values

Code
# mean of predicted values for each hour
predicted_values <- predict(poisson_model, newdata = bike_share_hour_poisson, type = "response")

predicted_values_mean <- bike_share_hour_poisson |> 
  mutate(predicted_values = predicted_values) |> 
  group_by(hr) |> 
  summarize(mean_predicted_values = mean(predicted_values)) |> 
  ungroup()

bike_share_hour |> 
  ggplot() +
  geom_point(aes(x = hr, y = cnt), color = "skyblue", position = position_jitter(width = 0.1), alpha = 1/3) +
  geom_line(aes(x = as.numeric(hr) - 1, y = mean_predicted_values), data = predicted_values_mean) +
  labs(
    title = "Mean of Predicted Values (Poisson Regression)",
    x = "Hour of the Day",
    y = "Mean of Predicted Values"
  ) +
  theme_light()

End of Module 7

References

Cortez, Cerdeira, Paulo, and J. Reis. 2009. Wine Quality.” UCI Machine Learning Repository.
Draper, David. 1993. Challenger USA Space Shuttle O-Ring.” UCI Machine Learning Repository.
Fanaee-T, Hadi. 2013. Bike Sharing.” UCI Machine Learning Repository.
Mushroom.” 1987. UCI Machine Learning Repository.

Example: Musroom Data

UCI Machine Learning Repository

From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible: (Mushroom 1987)

Code
mushroom_file <- "datasets/mushroom/agaricus-lepiota.data"
mushroom_names <- c("class", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_attachment", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surface_above_ring", "stalk_surface_below_ring", "stalk_color_above_ring", "stalk_color_below_ring", "veil_type", "veil_color", "ring_number", "ring_type", "spore_print_color", "population", "habitat")

mushroom_df <- read_csv(mushroom_file,
                        na = "?",
                        col_names = mushroom_names, 
                        col_types = cols(.default = col_factor())) |> 
  mutate(poisonous = as_factor(if_else(class == "p", 1, 0))) |> 
  relocate(poisonous, .after = class) |> 
  select(-class, -veil_type, -cap_color, -odor, -gill_color, -stalk_color_above_ring, -stalk_color_below_ring,
         -veil_color, -ring_type)

# mushroom_df |>
#   group_by(poisonous, odor) |>
#   count() |>
#   ggplot() +
#   geom_col(aes(odor, n, fill = poisonous), position = "dodge")

mushroom_df

Multiple Logisitc Regression (1/)

Code
# # Load necessary libraries
# library(caret)
# library(pROC)
# 
# # Assuming your dataframe is called 'mushroom_df'
# # and the target variable is called 'poisonous' (1 for poisonous, 0 for edible)
# 
# # Split the data into training and testing sets
# set.seed(42)  # for reproducibility
# train_index <- createDataPartition(mushroom_df$poisonous, p = 0.7, list = FALSE)
# train_data <- mushroom_df[train_index, ]
# test_data <- mushroom_df[-train_index, ]
# 
# # Fit the logistic regression model
# model <- glm(poisonous ~ ., data = train_data, family = binomial)
# 
# # Summary of the model
# summary(model)
# 
# # Make predictions on the test set
# predictions <- predict(model, newdata = test_data, type = "response")
# 
# # Convert probabilities to binary predictions
# predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# 
# # Evaluate the model
# conf_matrix <- table(Predicted = predicted_classes, Actual = test_data$poisonous)
# accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
# precision <- conf_matrix[2,2] / sum(conf_matrix[2,])
# recall <- conf_matrix[2,2] / sum(conf_matrix[,2])
# f1_score <- 2 * (precision * recall) / (precision + recall)
# 
# # Print evaluation metrics
# print(paste("Accuracy:", accuracy))
# print(paste("Precision:", precision))
# print(paste("Recall:", recall))
# print(paste("F1 Score:", f1_score))
# 
# # Plot ROC curve
# roc_obj <- roc(test_data$poisonous, predictions)
# plot(roc_obj, main = "ROC Curve")
# auc_value <- auc(roc_obj)
# print(paste("AUC:", auc_value))
# 
# # Feature importance (based on absolute value of coefficients)
# feature_importance <- abs(coef(model))[-1]  # Exclude intercept
# feature_importance <- sort(feature_importance, decreasing = TRUE)
# print("Feature Importance:")
# print(feature_importance)